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ABSTRACT 

Blocking (Tiling) techniques of iteration spaces to 
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increase the critical blocking factor with no self- 
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algorithms were performed which showed a consistent superior 
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I. INTRODUCTION 


A. PERFORMANCE IMPROVEMENT IN A COMPUTER 
1. The Memory Bottleneck 
An essential component of every computer 1s its 


memory. Without memory there could be no computers as we now 


know them. There 1s an important axiom of hardware design: 
smaller is faster [Ref. 18]. Smaller pieces of hardware will 
generally be faster than larger pieces. In high-speed 


machines, signal propagation is a major cause of delay, and 
larger memories have more signal delay and require more levels 
to decode addresses. A basic attribute used to measure the 
effectiveness of a memory configuration is the memory band- 
width. This is the number of words that can be accessed per 
second or the rate of information transfer. Increasing the 
computational power without a corresponding increase in the 
memory bandwidth of data to and from memory can create a 
serious bottleneck [Ref. 18]. In other words, increasing 
memory bandwidth and decreasing the latency (execution time) of 
memory access are both crucial to system performance. Since 
these two measures are so much important, we should improve 


them to obtain a better performance. 


2. Cache Memory 

The analysis of a large number of computer programs 
has shown that during program execution, memory references 
tend to occur in very localized patterns. Consider, fam 
example, a block of contiguous memory locations. Tf thee 
block consists of straight line code, then execution will 
proceed sequentially through the block. If the block repre- 
sents a data array, then it is likely that the block will 
often be used, not necessarily in a sequential order, but at 
least as a whole. This characteristic of programs is referred 
to as the locality of reference principle. When a program is 
stored 1n main memory, the access time for fetching a memory 
word is a function of the capacity(size) of the memory and not 
a function of the size of a local block in the memory [Ref. 
19). Hence, to get the benefit from the locality of reference 
principle, high-speed buffer(s) can be inserted between the 
processor(s) and main memory to capture those active portions 
(blocks) of the contents of main memory currently in Gece 
Then, rather than fetching the word at the next location from 
the main memory, it could be fetched much more quickly from 
this special high-speed buffer called cache memory and 
implemented in most computers as shown in Figure 1. The major 
reasons for having a memory hierarchy are to get a better 
performance and to reduce execution time, not misses [Ref. 
18). If cache misses can be reduced by some means, then the 


overall performance of processors will improve. 





small fraction of a matrix. Thus even if the data are reused, 
they may have been displaced from the cache by the time they 
are reused, causing a high miss ratio in the cache. For this 
reason, blocking techniques are utilized for the goal of 
reducing memory traffic and exploiting this data reuse. 
Blocking techniques restructure the code to move references to 
the same memory location closer together, hence improving 
cache performance. Blocking can be applied:at different 
levels of the memory hierarchy, including physical memory, 
caches, and registers [Ref. 9]. In the experiments, blocking 


has been applied both at data cache and register levels. 


B. IMPROVING DATA LOCALITY 
1. Types of Misses in Cache 
Cache misses occur in one of three forms: 


* Compulsory misses: The first access to a block is not in 
the cache, so the block must be brought into the cache. 
These are called cold start misses or reference misses. 


* Capacity misses: If the cache cannot contain all the 
blocks needed during a program execution of a program, 
Capacity misses will occur due to blocks being discarded 
and later retrieved. 


* Conflict misses: If the block-placement strategy 1s set 

associative or direct mapped, conflict misses (in addition 

to compulsory and capacity misses) will occur because a 

block can be discarded and later retrieved if too many 

blocks map to its set. These are also called collision 
misses [Ref. 19]. 

Conceptually, conflict misses could be the easiest to 

avoid: Fully associative placement avoids all conflict misses 

by mapping an address in the main memory to any cache block. 


But, this associativity 1s expensive in hardware and may also 
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slow down the access time, leading to a lower overall perfor- 
mance [Ref. 19]. Therefore, fully associative caches are not 
built. Capacity misses can be reduced by using a large cache. 
Larger cache lines reduce the number of compulsory misses, but 
this may lead to an increase in conflict misses. 

2. Transformation (Restructuring) 

If a program can be transformed into a version that 
makes better use of the cache, then the number of requests to 
memory can be reduced, thus improving the execution time. 
Program transformations that move consecutive uses of a memory 
location closer together can increase the number of times a 
value is used before it 1s replaced. A cache miss removed by 
code transformation will require less traffic since the number 
of times a value is loaded into the cache is decreased [Ref. 
eal. 

The basic theory of this transformation (restructur- 
ing) process is based on data dependence analysis. Data 
dependence information 1s needed to test whether restructuring 
transformations are legal (the program produces the same 
answer after restructuring as it did before) [Ref. 13]. Data 
dependence analysis will be explained in detail in Chapter II. 
Peiretly, it 1S a tool which is applied to partition a serial 
program into blocks of code that contain well-defined depen- 
dences. The purpose of dependence analysis 1s to prove the 
presence or absence of data dependence between the blocks 


[Ref. 16]. These blocks may be converted into independent 


tasks, scheduled onto one or more parallel processors, and run 
as a parallel program. The problem gets complex and difficult 
when the blocks are control structures such as loops, 
branches, and procedures. 

Loops are difficult to analyze, and yet they are the 
best candidate for optimizations obtained by transformation 
techniques such as vectorization, fusion, coalescing, distri- 
bution, interchange, node splitting, shrinking, unfolds 
skewing, and blocking transformation techniques [Ref. 16]. 

3. Performance Experiments with Blocked Codes 

In this thesis, we experiment with techniques to 
improve data cache performance with blocked codes. We apply 
these techniques on two machines: DECstation 3100 and 
Solbourne S4000 SPARC station. Together with blocking, we 
investigate the effect of further optimizations such as 
unfolding (unrolling) on the data cache performance. 

Blocking techniques are used to increase the life time 
of the piece of data in the cache. Therefore, this allows 
reused data to still be in the cache and reduces the memory 
accesses. Blocking techniques improve cache performance by 
restructuring the code to move references to the same memory 
location closer together, hence eliminating cache misses. To 
1llustrate the benefits of blocking, we handle matrix multi- 
plication code, showing how it is blocked and how the reuse of 


data in that piece of code can be exploited. A basic matrix 


multiplication computes an inner product of a row and a column 


peice rices B ana C for each element of the result matrix A. 


BOF (Lf = 17 fa ee.) 7 
Dao Th oI? ieee 
RCo fg eee) at 


ee ame, EK); 


Figure 2 


Main loops of Matrix Multiplication 


We assume that we have a direct-mapped data cache and the 
arrays are stored rowwise in cache lines. If we substitute 
some loop bounds for the code above, then we can show the 
potential in it for the reuse easily. In Example 1 below, 


matrix multiplication code is utilized and N has the value 2. 


Example 1: 


Oe) = A ae eee eee fl = 1, Jd ={)> 23s 
C lymtjyt= All, 1) *GBiehry2) 
Cufelon elle += A (1 ze (2 ee fT =1, 0-2 ee 


Cet ee ce ee ee ee el ne enn ee ne ti i Onn ee ee 


SS ee ee 


As seen above, there are array variables which are repeatedly 
referenced and retrieved from the memory. For example, both 
A(l, 1) 1s referenced when K = 1 for different iterations of 
J loop. A(2, 2) present the same behavior when K = 2. All 
these repeating references cause unwanted cache traffic (if 
the cache is not large enough to hold the whole A array). 
Here we assume that we have a data cache size of 2° = 4 words 
and the reused array elements in the first four lines of 


Example 1 are placed in this cache as shown in Figure 3. 





data cache 


main memory 


Figure 3 


Data Cache and Main Memory 


To provide this type of placement in the cache, we can use 


Ssubblocks of the arrays as follows: 


ciel = Ale} fe Bi! ep A}-é sk Be! 


Gia = A} * Bi:2 ie A}-2 * Be’? 


5 
Cer = Ae} ae Bl a Aer? * BRB?! 


to 


Cee = A?) * BY i Re? * B*' 


The following figure shows these subblocks explicitly. 





Figure 4 


Subblocks in a matrix multiplication code when 
matrix size is 2 

This subblocking exhibits the advantage that the 
blocks can be sized to fit into the fastest level of the 
memory hierarchy such as cache memory, and by this way the 
data in each submatrix 1S used many times during each matrix 
multiplication. These benefits of subblocking can be enhanced 
vila program transformations which form the basis of blocking 
techniques. The main loops of matrix multiplication code in 


Figure 2 are blocked and shown in Figure 5. 
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Cea e= 0; 
temp = A[I][K]; 
Me = man (N,JI+B) ; 
hom (J=0g stoi; ++7){ 


Cee e lt iicies Cemp "Bi kita]; 


Figure 5 


Main Loops of Blocked Matrix Multiplication 


4. Previous Work on Blocking 
It has long been known that program restructuring can 


dramatically reduce the load on a memory hierarchy subsystem 


({1], [2], [3]). Previous research on blocking focused on how 
to block an algorithm manually and automatically ({[{5], [6], 
[7], [8]). Irigoin and Triolet [Ref. 4] describe a procedure 


to partition the iteration space of a tightly-loop into 
sSupernodes, where each supernode covers a set of iterations 
that will be scheduled as an atomic task on a processor. That 
procedure works from a new data dependence abstraction, called 


Wa 


the dependence cone. The dependence cone 1s used to find 
legal partitions and to find dependence constraints between 
supernodes. 

Lam, Rothberg, and Wolf [Ref. 9] show that the degree 
of cache interference is highly sensitive to the stride of 
data accesses and the size of the blocks, and can cause wide 
variations in machine performance for different matrix sizes. 
They presented cache performance data (obtained via 
Simulation) for blocked programs and evaluate several 
optimizations such as using a fixed blocking factor and 
copying non-contiguous data to be reused into a contig-uous 
area. They found that trying to use the entire cache, or even 
a fixed fraction of the cache 1s not so useful and propose an 
algorithm to tailor the block size according to matrix size to 
improve the average miss rate. 

Hong and Kung [Ref. 10] claim that the optimal 
blocking factor is roughly SORT(C) for matrix multiplveaeaem 
on a machine with a local memory of C words. 

In an algorithm implemented in Stanford University 
Intermediate Format Compiler [Ref. 8], the locality of a loop 
nest by transforming the code via interchange, reversal, 


skewing, and blocking 1s improved. 


C. ORGANIZATION OF THE THESIS 

In Chapter II, we focus on existing blocking techniques, 
by explaining their advantages and related terminology. We 
also present data obtained from the experiments with blocked 
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codes to observe the effectiveness of these blocking tech- 
niques. In Chapter III, we propose an enhanced blocking 
technique which improves the performance of workstations in 
some scientific applications. Conclusions and recommenda- 


tions for further research are offered in Chapter IV. 


AS: 


II. EXPERIMENTING WITH BLOCKING 


A. BLOCKING TECHNIQUES 
1. Program Restructuring 
Blocking techniques improve cache performance by 
restructuring the code, hence eliminating cache misses. 
Advanced compilers or supercompilers are capable of many 
program restructuring transformations such as vectorization, 
Strip mining and loop interchanging Siker- eee ae These 
transformations transform a program into a version that makes 
better use of the cache, thus the number of requests to memory 
is reduced and the execution time 1S improved. But super- 
compilers cannot transform every program efficiently; the 
quality of the results will depend on the structure op ye@e 
algorithm and the architecture of the target machine. Data 
dependences [Ref. 21] imply precedence constraints among 
computations which have to be satisfied for a correct 
execution. So, when determining the loops for restructuring, 
the existence of these dependences must be considered. In the 
next sections, concepts such as data dependence, data 
dependence graph and iteration space will be described. 
a. Data Dependence 
Two types of dependence occur in computer 
programs. Control Dependence 1s a consequence of the flow of 


control in a program. Execution of a statement in one path 
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under an if test 1s dependent on the if test taking the path. 
Ravic, the Statement under control of the if is control 
dependent upon the if test. Data Dependence 1s a consequence 
of the flow of data in aprogram. The value of an expression 
1s dependent upon the values of the variables used in the 
expression. Therefore, a statement which uses a variable in 
an expression 1s data dependent upon the statement which 
computes the value of the variable. 

In programming languages, such as C, Fortran, and 
Pascal, three kinds of data dependence may occur: flow- 
dependence(or true dependence), anti-dependence, and output- 
dependence. The first dependence relation occurs when a value 
computed(stored) in a statement S, 1s used(fetched) in some 
statement S,; we say that S,1is data flow-dependent on S, and 
write this as S,&S,. This type of data dependence relation 
Shows how the data flows between the statements of the 
program. The second kind of data dependence occurs when an 
item is used in statement S, before that item is reassigned in 
some statement S,; we say that S, 1s data anti-dependent on S, 
before that item 1s reassigned in some statement S, and write 
this as S, & S,. The last kind of data dependence occurs 
when an item is assigned in statement S, before that item is 
reassigned in some statement S,; we say that S, is data output- 
dependent on S, and write this as S, &° S, [Ref. 21]. We 


illustrate these dependencies in Example 2. 


i 


Example 2. 


Sl A = BF 
Sys Cae 
So Aa eee. 
S4 E = A 72 


The dependence relations for this code are: 

S2,53,S4 are data flow-dependent on S1,S2,S3 respectively; 
S3 is also data flow-dependent on Si; 

S3 is data anti-dependent on S2; 


S3 is data output-dependent on Sil. 


b. Dependence Graph 

A dependence relation is a precedence relation 
which requires execution of one statement before another. A 
parallelizing (or optimizing) compiler can build a dependence 
graph by using these dependences. In a dependence graph, 
nodes represent statements in the program and directed arcs 
represent dependence relations [Ref. 21]. The real advantage 
of dependence graphs is their ignoring the arbitrary ordering 
of statements and concentrating on the dependence precedence. 


The dependence graph for Example 2 1s depicted in Figure 6: 


ILis 





Figure 6 


Dependence Graph for the code Example 2 


c. Iteration Space 

A loop, when there is considerable potential for 

code optimization, can be said to describe an iteration space. 
A single for loop describes a one-dimensional iteration space, 
one axis of a Cartesian coordinate system. Each iteration of 
the for loop corresponds to a point along this axis. The for 
loop will visit the points along this axis in a specific 
order, as defined by the for statement. If we have two nested 
for loops, then they describe a two-dimensional iteration 


space. 


iy 


for (i, = 17 1257777 ae 


for (1, = 1; 1524753.) 9 


oars Falcons ala) = B(l;, 15) + Gee 
Ss, B(I,, I,,,) = A(I;, 1;) ae ey 
} 
} 
Figure 7 


Two-nested serial loops 


The loops above define a two-dimensional 5X4 iteration space 


illustrated in Figure 8. 
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Figure 8 


Iteration space for the code in Figure 7 


The shape of the iteration space does not need to 
be rectangular. The iteration space for the loops below is 
triangular, so it 1s called as a triangular loop. Other 


iteration space shapes can be defined by nested loop. 


18 


fOr (1-2) ee Fel) 
FOr (a ee Oe rt) oy 


Noga e ag) ee (apy) DD (1,7) 


Figure 9 


Triangular Loop 


In this research, a loop transformation technique 
blocking (iteration space tiling) is implemented on two 
different machines. The code used in the experiments 1s 
Seeimized by utilizing blocking optimizations to get better 
speedups. We also utilized another optimization technique 
eabled Joop unfolding on the blocked code but it did not help 
the performance of the data cache. Blocking utilizes two 
other transformation techniques, loop interchanging and strip 
mining. 

d. Loop Interchanging 

One of the most important restructuring transfor- 
mations is loop interchanging. The simplest example of loop 
interchanging is interchanging two-nested loop, such as the 


loop below: 


Ik 


for {i = 1; ia) 42a 
for (J = sve N- eee 


A(I,J+1) = A(I,J) * BUl) Ah) Vee Geese 


Figure 10 


Two-Nested Loops 


The nested loop above is a first order linear recurrence; most 
vector computers have no corresponding instruction, so the 
loop would be executed serially, or with some fast recurrence 


algorithm. By interchanging the loops: 


FOr (J = 1 Den See 
for (i = 2- 7i- 27a 


Af{i,J#1i) =A(ti,d) * Bil, kh) + vege 


Figure 11 


Two-Nested Loops After the Interchange 


The inner loop is vectorizable. Not all loop interchanges are 
legal. For instance, the loops below can not be interchanged: 
For (1c2<2 > 1a +e 
for {J =.1/7) cael soe 


A(i,J) = A({i-l, Tl) yee 
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The original loop uses newly computed values of the array Aon 
the right hand side; if the loops are interchanged, the 
transformed loop will use only old values of A on the right 
hand side. The transformed loop will compute different 
results and the transformation is therefore illegal. 
e. Strip Mining 

Vectorizing compilers often divide a single loop 
into a pair of loops, where the maximum trip count of the 
inner loop is equal to the maximum vector length of the 
machine. The loop in Example 3.a. can be converted into a 
Meir Of the loops in Example 3.b by a vectorizing compiler. 
This process is called strip mining. The original loop is 
divided into strips of some maximum size, the strip size, in 
Example 3.b., the inner loop (or element loop) has a strip 
size of B, which is the length of the vector register in the 
compiler. tiem oueer oop (1S loop, for strip loop) steps 
between the strips; 
Example 3.a. 

eet foe pao +e) Y 


Air yoo ( G) 


Sve A(T) 


Sy : Gera ik—-1) *" 2 
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Example 3.b. 
for (1S = 17 fee Ss eee 
for (I = IS; I=emin(N, IS + Bye ee 
=) ae AVE) = ee) eee 


San Cli) V2 Alr-1) ae 


2. Blocking (Iteration Space Tiling) 
a. Blocking Algorithms 

Blocking algorithms have been developed for the 
goal of reducing memory traffic. Its advantage is that the 
blocks can be sized to fit into the fastest level of the 
memory hierarchy, and that during each computation, the data 
in each block is used many times. Blocking tries to duplicate 
the benefits of block algorithms via program transformations. 

D. © Blocking (Tiiing) 

We define Blocking as dividing the iteration space 
into blocks (tiles) of some size and shape (typically squares 
or cubes), and traversing between the blocks to cover the 
whole iteration space. Optimal blocking for a memory 
hierarchy will find blocks such that all the data for each 
block will fit into the highest level of the memory hierarchy 
[Ref. 13]. This will reduce the number of intervening 
1terations and data fetched between data reuses. The reused 
data will still be in the cache, and hence memory accesses 
will be reduced. In other words, the traversal between the 


ay 


blocks will follow an order that will reduce the amount of 
data that needs to be moved when going to the next tile. To 
be most effective at blocking (tiling), block size must be 
tuned to fit into cache memory to allow the minimum number of 
misses to occur and generate the least traffic between the 
memory levels. 

Blocking, as a combination of strip mining and 
loop interchanging, can improve codes. The goal is to strip 
the innermost loop into pieces such that the new innermost 
loop fits entirely into cache memory. The newly created 
middle loop is then moved to the outer loop. Figure 12 below 


illustrates how blocking is applied to loops whenever it 1s 


legal. 
for (I = 1; I<N; ++I) { 
FOr] 177<N> +40) 4 
J Clo loro org 
} 
} 
becomes 


fOr oo emi WIN; JS = JJ + B) { 
formes = 1; -feN; +41) { 
for = ys JI<B, +47) -{¢ 


Jielejoleyerotia 2) 7 yaw, 


Figure 12 
Blocking. B is the strip size 


ZS 


A further optimized version of the code in Figure 12 is 
illustrated in Figure 13. The inner two loops iterate in 


square shaped blocks (BxB) while the outer two loops step 


between the blocks. 


for (fil = 1; fi<N; 2] eee ae 
for (JJ = 17 dd<N; Jd 2 Wo a ae 
for (I = If; T<min(N,; 2! 7 78)0 2 ee 
for (J =—adae ace B); +4+J) { 


HVeleye: iersleny: 


Figure 13 


Further Optimized version of the code in Figure 12 


The iteration space 


for the above code ie 


is shown in Figure Be ee 


ioe 





Figure 14 


Iteration Space for the code in Figure 13 
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B. PERFORMANCE EXPERIMENTS WITH BLOCKED MATRIX MULTIPLICATION 
CODE 


The experiments are performed on two different machines 
located in The Naval PostGraduate School. Matrix multiplica- 
tion is utilized as the main source code written in C 
language. Matrix multiplication is a building block in many 
linear algebraic algorithms. It is also an interesting case 
study because locality 1s carried in three different loops by 
three different variables. The entire Sen ete sab x 
Multiplication involves 2N* arithmetic operations (counting 
additions and multiplications separately), but produces and 
Sencumes Only 3N* data values. As a whole, the computation 
exhibits “admirable reuse of data" [Ref. 23]. In general, 
however, an entire matrix will not fit ina small data cache 
memory. So, the code is restructured such that the necessary 
reuse of data 1S achieved in cache. 

1. Experimental Setup 

In our experiments, three different matrix sizes (293, 
mos, 300)* are utilized and the performance of blocked matrix 
multiplication is observed by experimenting with square shaped 
blocks having sizes of multiples of 8. To ensure the accuracy 
of the timing results obtained, each experiment has been run 
5 times by using the library function system() which provides 


access to operating system commands. 


‘These values were chosen so that we compare our results to 
those in [{Ref. 9]. 
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The "MFLOPS" rating of the machine was Ut1l1zeamcs sare 
performance measure for the blocked portion Gf the Code years 
the loops. We had to find the time of that portion. The 
command "time" provided in C compiler is not appropriate tc 
measure for that type of computation. Because, i1t 1s an 
executable program available when using the shell and it 
measures the time of an executable program from the beginning 
to the end. So, in our experiments, another structure called 
getrusage 1s used. "getrusage" returns the user time utilized 
by the current process, or all its terminated child processes. 
By calling it twice, at the beginning and at the end of the 
nested loops, we measured the time we were looking for. The 


code used in the experiments is presented in Figure 15 below. 


float Timingl, Timing2, Difference, Performance; 
Struct rusage “xX, *y77* decilerzae. one 


/* other declarations and lines of code will be here*/ 


x = (struct rusage *) malloc (sizeof(struct rusage)); 
y = (struct rusage *) malloc (sizeof(struces 7 usagoe, 
if (getrusage (RUSAGE_SELF, x) == -1) 


perror("getrusage()"); 
/* Nested loops are put here */ 
if (getrusage (RUSAGE_SELF, y) == -1) 
perror("getrusage()"); 
Timing] = (y->ru_utime.tv_usec - X->ru_utime.6V_ useay 200s 
Timing? = y->ru_utime.tv_usec - X-Sru_Utimeseve Kee 
if (Timanogl = 30per, 
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sibuneelog famang2 — i- 
Timingwew—=—. 000 + Timing! ; 
} 
Difference = Timing2 + Timingl / 1000; 


Performance = (2*N*N*N) / (1000000*Difference); /* MFLOPS */ 


Figure 15 


The code to compute the performance of the cache 


During the experiments, we needed to edit files 
quickly to change the values for matrix and block size. We 
intensely used the Unix editor, sed, to make changes to the 
files on the command line. Also, usage of Tschell (shell) 
together with sed allowed the task of making frequent changes 
to the source files by the processor less tedious. 

2. Overview of The Targeted Machine Architectures 

The two machines used in the experiments are Solbourne 
S4000 SPARC Station and DECstation 3100. The Solbourne S4000 
SPARC Station has a 64-bit SPARC CPU, with a 2kB 2-way set- 
associative write-back physical data cache (DCACHE) and 6kB 3- 
way set-associative instruction cache (ICACHE). The processor 
performs a load/store instruction in one clock, achieving 25.5 
MIPS. In SPARC microprocessor, all performance-critical 
element--integer CPU, floating point processor, memory 
Management unit and cache--are integrated on a single chip. 
The 64-bit memory bus supports a data transfer rate of 60 


Mbytes per second to accommodate the high performance 
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requirements of the processor. The DECstation 3100 has an 8kB 
double word direct-mapped cache and can hold all the words 
reused within an 88x88 block. For this reason, experiments 
are also done with block sizes over 88x88 in order to measure 
the real performance. Otherwise, the data cache can hold 
88x88 (or smaller) blocks and the effects of self-interference 
misses may not be observed. 
3. Blocking Experiments 

The main loops of unblocked and blocked matrix 
multiplication are illustrated in Figure 16 “anda 
respectively. 

sige agua ole 0) 5 Saino Were: 

ODI j=0, fa = eg ad 
for (2=07 4-<N- 2 yee 


cfaj{jj = ¢fijfj] + temp“ Sia 


Figure 16 


Main Loops of Unblocked Matrix Multiplication 


For WKK=07 Kk? hia ee 
fOr (37=05 FIaN> G3 =3 0 eee 
for (1=0; aN +e 
for (k=kk; k«<min(N/KKfS)- oe 


Cla io ae 


As 


temp = afi] [k]; 
MV =e (N, 77 #e)s 
Ore (5) 5) J) age ee 
ve wae yer Lemp DIK) {7 /7 


-_ ¢# e# # @ 


Figure 17 


Main Loops of Blocked Matrix Multiplication 


First, the code in Figure 17 is blocked at both data cache 
and register levels. af[i][k] is register allocated and this 
relatively increases the performance of blocked code. 
Secondly, min function is handled with a macro. In the very 
first experiments, the min function was placed in the for 


statement (as in Figure 18). 


pemem=— aii) [Kk]; 
OmmGi=—ig 7 Jim, ges); tog) 4 
eyes) = Cle ee eemp blk) [7]; 
Figure 18 


The for statement with min function in it 


Removing it as in Figure 19 improved the performance 


drastically. 


BS 


temp = afijfk]; 
M = min(N,j3j+8); 
fOr jsp, Jee ee 
cflijJ{j] = cfij{j] + tempt ayasere 
Figure 19 
The for statement without min function in it 
a. Cache Performance Experiments with Blocked Codes 
on Solbourne S4000 SPARC station 
In Figure 20, the performance of blocked matrix 
multiplication is plotted on Solbourne S4000 SPARC statwenme 
The code is compiled without enabling the code optimization of 
gcc (the gnu compiler). The graph in Figure 20 plotew@aa 
performance levels obtained for three slightly different 
matrix s1zes across a range of blocking factors. Blockiig 
effects are not observed when code optimization of the 
compiler is not enabled. There is a negligible change (<6%) 
between performances of blocked and non-blocked code when no 
optimization switch is used. The reason is that the number of 
instructions in the non-optimized code was much bigger than 
that of the optimized one and the percentage of memory access 
instructions to total memory instructions was very small. In 
other words, the code was running inefficiently. In contrast, 


optimized code had a smaller number of memory instructions 
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Figure 20 





The Performance of Blocked Matrix Multiplication on 
Solbourne S4000 SPARC. Compiler optimization para- 
meters are not enabled during compilation process. 
The abbreviation N.BL. means No BLocking 
and most of them were to access memory, amplying its 
efficiency. 

Figure 21 is similar to Figure 20 except that gcc 
compiler optimization was invoked. The code was compiled by 
mae Command “Ge -O2 source code:c " at the prompt. The 
performance on Solbourne S4000 SPARC station gets better when 
the block size is 16. Because 1t has an 2kB 2-way set- 


associative data cache and a word length of 4 bytes. This 


corresponds to a data cache size of 512 words. For a local 


ot 


memory of size C, the optimal blocking factor is rouge 
sqrt(cC) [9]. Since the data cache in SPARC station is two-way 
set associative, sqrt(512/2) is equal to 16 and therefore a 
block size of 16 x 16 results in a better performance. As 
observed in Figure 21, optimized blocked code with a block ume 
factor of 16 1s 40% better in performance than the one with no 


block ings 


MFLOPS 





2.2 
8 16 24 32 40 48 56 64 72 80 88 96 104 112 N.BL 
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Figure 21 


The Performance of Blocked Matrix Multiplication 
on Solbourne S4000 SPARC station. Compiler 
optimization switch -02 is enabled 
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b. Cache Performance Experiments with Blocked Codes 
on DECstation 3100 


Figure 22 shows the graph for the compiler 
optimized blocked code. Blocking optimizes the code and we 
also observe a better performance of blocking if the code is 
Compiler optimized for DECstation 3100. As shown in Figure 
22, Matrix sizes 295 and 300 present almost the same perfor- 
mance of the data cache and consistent increase until we reach 
Igbock Size of 7/72. At block size of 80, both of them show 
their highest performances. After this block size, no big 
change is observed in the plottings of these two matrix sizes. 
Matrix size 293 present a steadily increasing performance like 
the previous two. But, after achieving its best performance 
at block size of 40, it behaves differently and the perfor- 
mance for the data cache degrades drastically beginning from 
the block size of 56. The lowest performance for three matrix 
sizes 1s obtained when no blocking is applied. Because, the 
interference misses of the cache increases in this case. This 
means that the data which we can reuse are replaced by self- 
interfering array elements. 


c. Comparison of The Two Machines with respect to 
Speedup measure 


The two machines, DECstation 3100 and Solbourne 
S4000 SPARC station are compared with respect to speedup for 
meererent blocking factors in Figure 23. Speedup is calcu- 
lated by dividing each performance value by performance of 


unblocked code for the respective machine. Solbourne S4000 
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Figure 22 





The Performance of Compiler Optimized Blocked Matrix 

Multiplication on DECstation 3100. Compiler optimi- 

zation switch -0O2 is used for compilation process 
SPARC station shows the highest increase in performance at 
block size of 16 for three matrix sizes. This is related to 
the data cache size. But this increase does not last long and 
a decrease 1s observed after block size 16 which is the 
optimal blocking factor f6r SPARG astaewoene The speedup 
decreases consistently after blocking factor of 32. As the 
block size increases over this value, the performance of the 
data cache decreases due to self-interference. 

For matrix sizes 295 and 300, DECStation Jagm 

shows consistent performance increase as the blocking factor 
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Figure 23 


Comparison of DECstation 3100 and Solbourne Ss4000 
SPARC station with respect to speedup vs blocking 
factors for matrix sizes of 293, 295, 300 
increases. But, matrix size of 293 does not behave like the 
previous two. The performance of data cache gets worse after 
Bie blocking factor of 56. This shows that very similar 


matrix sizes may behave differently. Data cache in DECstation 


3100 displays better performance for a wide range of blocking 


BS 


factors where the data cache in Solbourne S4000"6PAR0C 6 oe 
performs its best just for blocking tactemeeums se 


d. Cache Performance Experiments with Blocked 
Unfolded Codes on Solbourne S4000 SPARC station 


In an attempt to furthermore increase in the 
performance of the data cache, experiments with loop unfolding 
(unrolling) technique were performed. Loop unfolding is the 
process of replacing the iterations of a loop with noniterated 
straight-line code [Ref. 16]. Shown in Example 4 are the 
codes for matrix multiplication without and “wate 
Hatoldand: The code is unfolded 3 times, where U is the 


variable to show how many times the loop is unfolded. 


Example 4. 
for (k=kk; k<min(N,KK+B); ++k) { 
temp = afi] [kK]; 
Mo SSeS 
for (93=77; 3<M?; 3=37+8) (92 eer intoleraewae 


cfijijj] += temp ~ bDigswiggs 


is unfolded to 
for (K=kk; k<min(N,kKk+B); ++k) f{ 
temp = afi] [k]; 
Mes. min{(N, 7942); 
if (M % U==0) { /* control statement for unstoiat7oae 


for (j=33; 3<M; j=3+U) { /* U=3, untoiced ae 
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een) aa meena ~ DTK] {ply 
So jpg emo = Df/k] ff y+1)-; 


Gog ijtei 2a eeeme 4 bf k]ip+2 i; 


else { 
G2 (5) 5) jag SUI ae 


elie eecemo * bik] [ji]; 


The data obtained from the experiments with unfolded blocked 
Matrix multiplication code shows that the overhead on the 
control statement plays an important role in reducing the 
performance. So, blocked codes with and without unfolding 
achieve negligibly different performances. Figure 24 
illustrates this cache behavior for the unfolded 8 times. 
Unfolding did not improve the performance on the blocked code 
because of the control statement overhead. The degree of 
unfolding the loop iterations affects the performance 


eserghtly. 
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Figure 24 





Performance for a blocked and 8 times unfolded 

matrix multiplication code, Solbourne S4000 

SPARC station 
The graph in Figure 25 plots the performance levels for 
blocked matrix multiplication code with 16 and 8 times 
unfolded. As presented in Figure 25, the performances of 16 
and 8 times unfolded codes are not extremely different from 
each other. This again shows that the control statement (if 
statement) in the code before the unfolded lines of loop 
iterations is the main reason of the overhead, rather than the 


tterations which are not wanted ec 
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Figure 25 


Performance for a blocked matrix multiplication code 
unfolded 16 and 8 times, Solbourne S4000 SPARC station 
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III. ALGORITHMS TO IMPROVE DATA CACHE PERFORMANCE 


Ax SELF-INTERFERENCE IN CACHE 

A reused variable of an array will miss in the cache if 
the references between reuse occupy the same cache location, 
thus leading to self-interference misses. If the accesses are 
done in a stride one manner, then no int entenemee will ceée@Gis 
unless the number of accessed data is bigger than the cache 
size. Otherwise, since the access pattern of a stride one or 
a constant stride array is uniform, its self-interference 
pattern also becomes uniform and increases the cache misses. 
In order to avoid self interference totally, "the largest 
block size that does not suffer from any self-interference 


fea)" a2Seeompueed. 


B. THE CRITICAL BLOCKING FACTOR 

Lam, Rothberg, and Wolf have developed an algorithm to 
find the largest square block that avoids self-interference 
for a given matrix size. This algorithm tailors the bilochimsg 
factor according to the problem size, 1.€., matrix sS1Ze)ee 
improve the average miss rate [Ref. 9]. By doing so, they 
intend to improve the average miss rate and to reduce the 
variance. Since the periodicity in the addressing of a 
direct-mapped cache and the constant-stride accesses are 


obvious, it is relatively easy to determine B.. Their 
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determine the largest square block with no self-interference 


meschown im Figure 26. 


algorithm FindB(N, C : integer) return integer; 
maxWwidthn, addr, di, dy, xX, Y, N : integer; 
Mmamevidth = min(N, Cy; 
neg = N/2; 
while true do 


@ear = addr + C; 


ai addr div N; 


aj abs((addr mod N) - N/2); 


Man age == MIn(maxw2ath, dz) ) 
return min(maxWidth, di); 
else 
maxWidth = di >= min(maxWidth, dj; 
end while; 


end algorithm; 


Figure 26 
Lam’s Algorithm to compute the Critical 
Blocking Factor B, 
1. Sensitivity of B.with respect to Matrix Size 
B. obtained from Lam’s algorithm for some matrix sizes 
draws attention to its sensitivity with respect to matrix 
size. Figure 27 shows the critical blocking factor for a some 


matrix sizes between 293 and 323. 
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Figure 27 


B. versus Matrix Size. Solbourne $4000 SPARC Station 


The critical blocking factor B,.cannot be larger than SsqrieiGe 
so the run time of the algorithm is O(N/sGreé(C)) ket ae 

Figure 28 plots the performance of the data cache for the 
values shown in Figure 27. The performance is greatly 


affected by value of B,, dropping to 1.7 Mflops for a 32 ieee 
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Figure 28 
Data Cache Performance versus Matrix Size. B, is 
obtained by using Lam’s Algorithm for Matrix Size. 
Solbourne S4000 SPARC Station. 
matrix and jumping to 2.0 Mflops (an increase of 17%) fora 
Syexs23 matrix. 
2. Effect of Declared Matrix Size on B, 

It 1S important to obtain consistent performance from 
an algorithm. Small changes in the values of the inputs 
generally should not cause huge swings in the performance of 
the algorithm. Users tend to use arbitrary array sizes larger 
than maximum expected problem size. 
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Tf we use Lam’s Algorithm which changes the block size 
by searching for a better B, then, for a given matrix size, 
we can keep increasing the size of the matrix and determine 
another B. until we find a block size which gives a better 
performance, preferably sgqrt(C) where C is an integer. 

In Figure 29, the performance variation is shown for 
actual matrix sizes of 293, 295, and 300, where B.is computed 
while extending the size of the matrix. Here, wé demonstrate 
the sensitivity of Lam’s Algorithm to the changes on the 
actual matrix size. Even a small change on one dimension of 
array b[k][3] affects the performance of the code blocked with 
a blocking factor for a matrix size as big as that new 
dimension. For example, when we use actual matrix size of 293 
without any change on any dimension, data cache performance is 
2.7 Mflops. If we declare one dimension of the same matrix to 
be 310, then the performance jumps up to 3.05 Mflops “(amg 
increase of 13%). For dimension of 321, the performance 
decreases from 3.05 Mflops to 2.2 (a decrease of 38%). 

Since calculation of B. 1s sensitive to this small 
change, we can draw insights to this fact as follows: 

1. Block size is highly sensitive to matrix size. 

2. Some bigger block sizes after this change shows 

that cache is not filled as much as possible with 
the block size computed with the algorithm. In 
other words, the bigger portion of cache used, 
the better cache performance for some array 
sizes. As long as the in the cache, in a certain 


range, if we increase the block size, it helps 
decreasing misses. The following equation in 
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Figure 29 
Extended Dimension versus Performance for Actual 
Matrix Sizes of 293, 295, and 300. Solbourne S4000 
Sparc Station 
Figure 30 models the miss rate in terms of parameters used in 


the blocked matrix multiplication and calculates the total 


number of cache misses [Ref. 9]. 


« N’[2/B + Si(b) + 3(1 - Si(b))B/C + B/C] 


where N° = the number of operations performed, 
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B = bileGcking —faceow 
Cc eAeneows mace 
Si(b)= (1 - B/C)®! = self-interference of 


accessing B-1 rows of array bD 


before the same data is reused. 


Figure 30 


Equation to compute total number of misses in cache 


According to this equation, there are N*(2/B) compulsory 
misses, misses that are intrinsic to the algorithm given the 
blocking factor and cannot be avoided even if the address 
mapping 1s perfect. The factor Si(b) 1s due to self inter- 
ference among the elements in the b array. Any blockage 
factor lower than the critical blocking factor does not cause 
self interference. The other two terms are due to cross 
interference between different variables. Example 5 below is 


presented to quantitavely demonstrate the effect of B,. 


Example 5. 

If actual matrix size = 293 and €”"="256, (Cheng =a 
computed with Lam’s Algorithm presented in Figure 26. With 
these values, the total number of misses = 1.17N’. If we 
fix the cache size and extend the declared matrix size to 
304 as determined with our seacrh technique, then B. = 16 is 
computed with these values, the total number of misses = 
0.68N°?. We showed that the total number of misses for a 
block size obtained with Lam’s Algorithm 1S worse than that 
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PEeOMEalmed with the extended matrix size (1.17/0.68 = 1.72, 
72% degrades the performance). Therefore, having a bigger 
blocking factor and an extended matrix dimension leads to 
a better data cache performance, while decreasing the miss 
uBleS- 
C. CHANGING ARRAY SIZE VIA A SEARCH TECHNIQUE AND DETERMINING 
THE CRITICAL BLOCKING FACTOR 
We present a search technique to find values of B. larger 
than those computed using the algorithm in [Ref. 9]. Tala 
technique computes B. by searching through all blocking 
factors for the range of matrix sizes up to 10% bigger than 
the actual matrix size. It mutually employs the algorithm in 
Figure 26 until it returns the biggest blocking factor, less 
than or equal to sqrt(C), where C 1s cache size, and bigger 
than 1, which is the smallest blocking factor that we can 
assign. We demonstrated that this blocking factor improves 
data cache performance as shown in Example 5. The following 


C language code in Figure 31 illustrates this technique. 


faeeaigorithm COMPUTE_B */ 
me@erine min(a,b) ((a<b) ? a:b)” 7* min function */ 
mae algo(int, double); 
main () 
{ 
moe k, N, Nmax, B, Bmax=1, i, Niimit; 
double C; 
fmeintcrt(" Bnter the size of the matrix: \n"); 
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scanf("%d", &N); 

printf(" Enter the size of the cache a7) 

SGanl (seer 

Nmax = IN; 

gata ec Cea els, /* max allowable matrix size increase */ 


Na ae N + K> of #omex wmetrix si zeus 


for (2 = N; 2 <=] Nigmic- 471) 
lee—etcefetgNt. — Syl. 
aif ((B <=-sqrt(C)). 46 (fe = oe 
Binax = 155 


Niax c= -N- 


! 
printf("New Matrix Size= @d\n", Nmax); 


Drintt ( BileocwineG Paceou— 2041 7 ones 


int algo(N, C) /* Lam’s Algorithm (algeorveae 27a) 
img 2 Oure 2 47 
Figure 31 
A search technique coupled with Lam’s Algorithm 
tov famnd. 5. 
The technique keeps extending the actual matrix size within 
the limit of 10% of the actual matrix size. For every s12Zé, 


it computes B. by repetitively applying Lam’s Algorithm in 
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Figure 26, and retrieves the first biggest blocking factor and 
the corresponding extended matrix size. 

Here the choice of 10% is arbitrary. In practice, we do 
not need to extend both rows and columns of the actual matrix. 
Because, the advantage of this approach is that it finds an 
extended matrix size which results in a better data cache 
performance than that of Lam’s Algorithm. Miss rates obtained 
from the equation in Figure 30 by substituting the values for 
two different matrix sizes show that matrix size and blocking 
factor from our algorithm present a better total number of 
misses. Moreover, in Example 5, cache portion used for the 


Ibocking factor computed with Lam’s Algorithm is 


mex /) / 2756 = 0.19 --> 19% of the cache, 


while the blocking factor computed with our technique allows 


(16 x 16) / 256 = 1.00 --> 100% of the cache to be utilized. 


Therefore, the technique presented above also enables us to 
utilize data cache optimally. However, this approach suffers 
from using additional columns/rows of a matrix. 
D. A NEW ALGORITHM TO FIND THE CRITICAL BLOCKING FACTOR WITH 
NO SELF-INTERFERENCE 
The drawback of the search technique in the previous 
section 1s that 1t uses Lam’s algorithm which has a complexity 
mere OIiN/sqrt(C)) [Ref. 9]. For each extended matrix s1i1Ze, 


using this algorithm becomes an overhead for the search 
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technique. Additionally, we have no guarantee for an 
acceptable choice of B.. 

In this section, we introduce a new algorithm which 
improves data cache performance by determining the critical 
blocking factor with no self-interference. The algorithm is 


shown in Figure 32. 


J “DIRECT AlCor? came, 

Die eoeco (imks, Jae. 

main () 

{ 
ant SIZE CC, 9Cl) Oboe S? 7e aes = er ee eer 
OGD SI ZE~= oO 1ZE; 


if (sqrt(C) is not an integer ) 


Ce = (C7245 
else 
‘gee SG. 


\ 


X = sqrt(C)-(SIZE # sgrt(C’));/* these two finesse = 
ak ie eee, 
S148 = (OZ eae /* to be a multiplete- 
SQt EG) aa 
Gout 
GCD = gcd((SiIZE/sqrt (C’)) -wsGn. (Gane /* gcd tesime 
Zt WUGCD =) 
B= %s0rene 2) 
else 
SIZE = SIZE + sqrt(C’); 
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Peigerie (B /= sqré(C’)); 
med (nba Vly ont. V2je 1 
int temp; 
while (V2) { 
Eenmr= V2; 
Vin Vilas yn 
Vl = temp; 
} 


meeeurn Vi->; 


Figure 32 


A New Algorithm to find B, 


Direct Algorithm allows us to use sqrt(C) as optimal blocking 
factor if sqrt(C) 1S an integer all the time. If not, the 
optimal blocking factor is considered as sqrt(C/2), because we 
@ieedin a multiple of two for C/2. The underlying principle is 
to cover data cache as much as possible at the expense of 
potentially using more memory. 

We assume that the matrix has sqrt(C) elements, where C 1s 
the cache size. If each element in the matrix, which we call 
it super_ element E, contains sqrt(C) elements with a stride 
S between them and if cache is divided by the size of sqrt(C) 
Super_elements, then (C / E) corresponds to the number of 


blocks in the cache. So, if S and (C / E) are relatively 


Syl 


prime, which also means that their greatest common divisor 
(gcd) is 1, then every super_element will be placed in a new 
super_location up to (C / E) different Super lo@ae wen 

The advantages of Direct Algorithm: 

- The time consumed to compute the extended matrix size and 
the corresponding B.1is shorter than the one in Lam’s 
Algorithm. Therefore, we save time. 

- The blocking factor is always sqrt(C) or sqrt(C/2) due to 
the value of cache size, at which we get a better 
performance than that of Lam’s Algorithm. 

The disadvantages of Direct Algorithm: 

- As in the previous section, this algorithm also suffers 
from using additional rows/columns which is an overhead 
for the computation. The maximum addition in size is 2 * 
SORT(C) -1 (1£ SORT(C) 18 an integer). 

This algorithm is experimented on Solbourne S4000 SPARC 
station and DECstation 3100. Figure 33 plots the performance 
of the data cache for a wide range of matrix sizes on SPARC 
Station. Figure 34 plots the actual matrix size versus the 
extended matrix size obtained with Direct Algorithm. 


This algorithm finds a matrix size suitable to be used 


with a block size of sqrt(C) if 1t 18 an integer (or Samii 


1f not). Since cache size is 512 words for a two-way associa- 
tive data cache in SPARC station, sqrt(5l2/2) = sqrie(Zac) ame 
1s an integer. So, we use almost 100% of the cache. The 


above performance curve in the graph shows the performance for 
a Matrix size determined by Direct Algorithm with a block size 


of sqrt(C). It presents a better performance level compared 
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Figure 33 


Performance levels with optimal block size of 16 
(without any change on matrix size) and with a block 
size of sqrt(C) (for extended matrix size determined 
by the algorithm above) on the data cache of Solbourne 
S4000 SPARC station. 
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Figure 34 
The Actual Matrix Size versus the Extended Matrix 


Size obtained with Direct Algorithm. Solbourne 
S4000 SPARC station 
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to the one below and behaves very consistent. The variation 
in performance for the one below is due to the interference 
misses in the data cache. 

In the experiment on DECstation 3100, three matrix 
sizes(293, 295, 300) are used and performance level of the 
data cache is observed for the values obtained with the new 
algorithm and the best blocking factors obtained previously. 

Figure 35 shows the two performance levels of the data 
cache on DECstation 3100. 

Figure 36 plots the actual matrix size versus the extended 
Matrix size obtained with Direct Algorithm. DECstation 3100 
has 8K double word direct mapped cache where sqrt(C) is not an 
integer. So, Direct Algorithm gets half of the cache and 
determines the new matrix size shown in Figure 36 for sqrt 
(e/2) . 

Ctr algorithm performs its best if the number of distinct 
array references to data cache is less than or equal to the 
associativity of data cache. Otherwise, interference among 


the references may occur. 
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Figure 35 


Performance levels with optimal block sizes (with no change 
on matrix size) and with a block size of sqrt(C/2) (with 
matrix size determined by the algorithm above) for the data 
cache on DECstation 3100 
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Figure 36 
The Actual Matrix Size versus the 
Extended Matrix Size obtained with 
Direct Algorithm. DECstation 3100 
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IV. CONCLUSIONS AND RECOMMENDATIONS FOR 
FURTHER RESEARCH 

In Chapter I, we reviewed the notion of reuse and reported 
the importance of the number of cache misses for RISC proces- 
SOre., Since there 1s an amazingly large reuse of data in 
cache, we focused on some techniques, namely blocking 
techniques, which are used to reduce acne traffic “aaa 
exploit this data reuse. 

In Chapter II, we performed several experiments on 
blocking and presented the results similar to those in [Ref. 
9) for a better understanding of the effect of blocking on the 
performance of two different data caches. 

In Chapter III, we demonstrated the sensitivity of the 
algorithm in [{Ref. 9] and the sensitivity of blocking facwen 
obtained by that algorithm to the declared array size. a 
fact, this sensitivity shows that the performance of the 
algorithm is very unstable related to the size of the matrix. 
To remedy this problem, we introduced two algorithms: The 
first one 1S a search technique which finds values of B, 
larger than those computed using the algorithm in [{Ref. 9]. 
This technique computes B. by searching through all blocking 
factors for the range of matrix sizes up to 10% bigger than 
the actual matrix size. It iteratively employs the algorithm 
in [9] until it returns the biggest blocking factor, less tied 
or equal to sqrt(C), where C 1s cache size, and bigger than 1, 
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which is the smallest blocking factor that we can assign. The 
other 1s Direct Algorithm which improves the performance of 
workstations in some specific applications. This algorithm 
finds a matrix size suitable to be used with a block size of 
femere) 1f 1t 1S an integer (or sqrt(C/2) if not). It gives 
the best results when the number of distinct array references 
is less than or equal to the associativity of data cache. 
For further research we recommend looking at the 
algorithms in situations where the number of distinct forms of 
array references 1s greater than the cache associativity 


maector . 
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